Skip to content

fix: Backport scheduler changes to cloud/v1.32.0-157#10608

Open
davidporter-id-au wants to merge 8 commits into
temporalio:cloud/v1.32.0-157from
davidporter-id-au:cherry-pick/v1.32.0-157/scheduler-backports
Open

fix: Backport scheduler changes to cloud/v1.32.0-157#10608
davidporter-id-au wants to merge 8 commits into
temporalio:cloud/v1.32.0-157from
davidporter-id-au:cherry-pick/v1.32.0-157/scheduler-backports

Conversation

@davidporter-id-au

Copy link
Copy Markdown
Contributor

Backports a series of scheduler changes onto cloud/v1.32.0-157. Cherry-picked in chronological order from main (applied cleanly as a stack; no conflicts when ordered correctly).

Commits (oldest → newest)

Upstream PR Description
#10348 Add percentage-based dialup for creation and migration
#10439 Adds scheduleIdleCloseTime
#10517 Plumb Tweakables through Scheduler's CHASM context values, use in EventLog
#10369 Rework and instrument CHASM scheduler tasks, add coverage
#10530 Log decisions and task events via the EventLog
#10505 Add a flag to allow only migration of schedules without running workflows
#10406 Adding scanner for invariants for schedules v2
#10503 Add scheduler observability metrics for missed actions

Note: #10390 (next-action-time) was already present on the target branch and is omitted.

Builds clean (go build over affected packages).

🤖 Generated with Claude Code

lina-temporal and others added 8 commits June 8, 2026 12:33
…emporalio#10348)

## What changed?
- Gates CHASM scheduler creation and migration against a new rollout
percentage, based on a stable hash of namespace + business ID.
- Defaults to 100%, to preserve existing rollouts (rollout tooling will
be updated to also set these keys)
- Added a RolloutApplies helper, since we'll likely want to do this for
other features going forward

## Why?
- We want finely-grained control for rollouts even within a namespace.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

(cherry picked from commit d4e4958)
## What changed?
This adds ScheduleCloseTime, the intent being it helps us identify
scheduler problems through automatic scanners

## Why?

There's a few cases where it's possible for the scheduler state-machines
to transition to invalid states if bugs are present. Looking for the
boundary at when we expect schedulers to fire helps us look for these
invalid states.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [X] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Should be fairly minimal, as again, this isn't a significant increase in
the number of writes to visibility

(cherry picked from commit 32984de)
…, use in EventLog (temporalio#10517)

## What changed?
- Tweakables is now available for access within any Scheduler function
that has a CHASM context.
- EventLog now uses the Tweakables from CHASM context to determine
dynamic config values.

## Why?
- It was looking real ugly to plumb through those numbers and that
dynamic config everywhere else throughout.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

(cherry picked from commit 6fc519f)
…of coverage (temporalio#10369)

Reworks CHASM scheduler's tasks to harden them against being silently
dropped, and fixes several edge-case bugs we've recently hit. Most of
these only surface in combination, so the fixes land together.

The logging here is just a few breadcrumbs — once the EventLog PR merges
I'll do a full logging/eventlogging pass, so don't read too much into
the individual log statements.

## What changed

### Generator
- `GeneratorTask` now keeps running while paused (advances the HWM,
updates `FutureActionTimes`, drops actions at the source) instead of
stopping — an improvement over V1.
- New `isManualOnly` / `isHeldOpen` helpers carve out schedules with no
automated wakeup source (empty spec, manual-only, pending backfill) so
the idle timer can't silently close them.

### Idle task
- Reworked around `isHeldOpen` plus an `.After()` deadline compare, so
sub-precision drift no longer drops tasks that should still fire.
- Fixes a non-monotonicity bug in `recentActions` (a closing workflow
could regress observed time).

### Invoker
- `ExecuteTask` captures the framework clock at read time and threads it
through `applyBackoff`, keeping `BackoffTime` consistent with
`LastProcessedTime` (no more wall-clock drift in tests/replay).
- `ProcessBufferTask` checks the catchup window before consuming a
`LimitedActions` slot.
- `recordExecuteResult` is idempotent per `RequestId`: a duplicate
`CompletedStart` / `RetryableStart` whose live entry already has a
`RunId` is dropped, without stomping `RunId`/`StartTime`/`HasCallback`
or rewinding `Attempt`/`BackoffTime`.
- `addTasks` is consolidated into a single-pass classifier; the
broadened re-arm gate fixes the stranded-retry bug.

### Backfiller
- `hasMoreAllowAllBackfills` was a misreading of the original code and
is gone — any pending backfill (not just `ALLOW_ALL`) holds a paused
schedule open while idling out.
- On completion, the backfiller revives the Generator so it can
re-evaluate idle/close eligibility.

## Tests

This PR carries only the minimal tests that go red→green with the fixes:

- **Functional** (`tests/schedule_test.go`): `TestPausedDropsCatchup`
and `TestFutureActionTimesAdvanceWhilePaused` — both fail on `main`; the
always-on Generator (advancing the HWM while paused) is what turns them
green.
- **Unit** (`chasm/lib/scheduler/`): targeted guards for each invoker
fix — idempotency, framework-clock, eligibility boundary, re-arm, and
catchup-before-limit — plus the modifications to pre-existing
idle/sentinel tests the rework requires.

Additional regression testing will live in a separate PR.

## Risks

This touches a fair amount of scheduler surface area, but the fixes are
individually validated and the diff is now mostly the rework itself.

(cherry picked from commit e9ae575)
…o#10530)

## What changed?
- Added internal EventLog entries for task processing and engine
decision points.

## Why?
- Gives us a nice audit log of what's changing with scheduler at
specific timestamps, akin to V1's event history.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
- Every write requries access to a CHASM mutableContext, which implies a
write will already occur on said component (this is how CHASM's
dirty-state tracking works). Additionally, all event logs have strict
upper bounds in length and individual item sizes.

(cherry picked from commit 9e389ce)
…ut running workflows (temporalio#10505)

## What changed?
- Added a new dynamic config flag to control whether V1 schedules will
migrate with running workflows.

## Why?
- When V1 schedules migrate to V2 with running workflows, V2 attaches a
completion callback to the running workflow (as V2 exclusively uses
Nexus completion callbacks to monitor workflow status). Certain
third-party SDKs are known to have issues with one of the events written
to the history of the running workflow, which can cause them to panic
(Coinbase SDK). This allows us to dial up migration for these customers
without triggering that edge case.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
- Slower migration cadence, potentially

(cherry picked from commit a046c79)
## What changed?

This introduces an optional scanner for looking through v2 schedules and
to perform some invariant checks and metric checks. The theory is that:
- We should generally not see schedules with the nextActionTime in the
past, and for those that do, they should be paused or be waiting for the
workflow to fire.
- We should not see workflows that are closed that are still around for
a long period of time
- We should expect (eventually, after migration and rollout of
schedules-v2) all workflows to fit into the category of closed or
running.

Some of these assertions are possibly a bit premature as my
understanding of the system is still ramping up, but it's relatively low
risk, all the scanner does is emit a log or metric.

## Why?

The intent is to just get a better handle on the schedules feature as it
rolls out, owing to its criticality. It's hard to test for all
edge-cases ahead of time, so looking for invariants in production can be
illuminating.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [X] added new unit test(s)
- [ ] added new functional test(s)

For testing, I intentionally added a bug in the scheduler to hang the
schedule after firing and then let it run and added lots of loud log
messages so I could see if the scanner was able to detect it, which it
was.

This does *not* cover the cases where permissions might differ across
namespaces, I still need to investigate that a bit more.

## Potential risks

Probably fairly low:
- A serious bug could cause a hot-loop of continue-as-new workflows if
the activities were both misconfigured and unable to start, but
otherwise it's unlikely to do much.
- A misconfiguration in the rate-limit might cause visibility to get
somewhat hammered by the scanner, generating some visibility load.

(cherry picked from commit 824d2d0)
…lio#10503)

## Summary

Add and improve scheduler metrics to detect missed actions caused by
system instability, slow callback delivery, and overlap policy behavior.
Changes apply to both v1 (workflow-based) and v2 (CHASM) schedulers.

## What is a missed action?

A scheduled action is "missed" when the scheduler decides not to execute
an action that was due according to the schedule spec. There are three
scenarios where this happens:

1. **Catchup window exceeded.** The scheduler was delayed in processing
and the action's scheduled execution time is now older than the
configured catchup window. The action is permanently dropped. This can
be caused by a previous action still running or by system instability
(worker downtime, shard movement, task queue backlog). Tracked by
`schedule_missed_catchup_window`.

2. **Overlap policy skip.** A previous action is still running and the
overlap policy is SKIP (or BUFFER_ONE with a full buffer). The scheduler
intentionally drops the action because the policy says not to start
concurrent runs. This is expected behavior, not a system issue -- but a
high skip rate may indicate action duration exceeds the schedule
interval. Tracked by `schedule_overlap_skipped`.

3. **Buffer overrun.** The internal action buffer is full (bounded by
`MaxBufferSize`). New actions are dropped regardless of overlap policy.
Tracked by `schedule_buffer_overruns`.

Scenarios 1 and 2 are the focus of this PR. Scenario 3 already had
metrics.

### Catchup window vs. overlap policy: v1 and v2 differences

An action is only ever counted in one of the above scenarios, but the
evaluation order and the possible outcomes differ between v1 and v2.

Both v1 and v2 have the same two-phase structure: a generation phase
that checks the catchup window and buffers actions, followed by a buffer
processing phase that evaluates overlap policy and executes actions. The
difference is in what happens to buffered actions that can't be executed
immediately.

**V1 (workflow-based):**

1. `processTimeRange` (generation) checks the catchup window. Actions
whose nominal time is older than the window are dropped
(`schedule_missed_catchup_window{reason="not_buffered"}`). Surviving
actions are added to the buffer.
2. `processBuffer` (invocation) refreshes running action state via
`PollMutableState`, then evaluates overlap policy on the entire buffer,
including actions deferred on previous ticks. Actions are either
started, skipped by overlap policy (`schedule_overlap_skipped`), or
deferred (BUFFER_ONE/BUFFER_ALL).

V1 has no `reason=buffer_expired` case because it never re-checks the
catchup window on buffered actions. Once an action passes the initial
catchup window check and enters the buffer, it will never be dropped for
being too old. If a BUFFER_ONE schedule has a long-running action, the
deferred action waits indefinitely across ticks until the running action
completes, at which point it is started regardless of how much time has
passed since its original schedule time.

**V2 (CHASM):**

1. The generator task checks the catchup window. Same as v1 -- actions
past the window are dropped
(`schedule_missed_catchup_window{reason="not_buffered"}`). Surviving
actions are buffered.
2. The invoker's `ProcessBuffer` task evaluates overlap policy. Actions
are either marked for execution, skipped by overlap
(`schedule_overlap_skipped`), or deferred (BUFFER_ONE with a running
action sets `Attempt = -1`).
3. Actions that survive overlap policy evaluation are checked against
the catchup window a second time. If the action was buffered in time but
has now expired -- due to the gap between CHASM tasks, overlap policy
deferral, or start retries -- it is dropped
(`schedule_missed_catchup_window{reason="buffer_expired"}`).

This second deadline check is unique to v2. It means a BUFFER_ONE action
that waits too long for a slot can be dropped, unlike v1 where it would
wait indefinitely. The `action_running` tag on `reason=buffer_expired`
distinguishes whether the expiration was caused by a long-running action
blocking the slot (`action_running=true`) or by system delay in CHASM
task processing (`action_running=false`). For starts deferred by
BUFFER_ONE/BUFFER_ALL, `action_running=true` is preserved when the start
expires because it waited behind a running action, even if that action
has completed by the time the metric is emitted.

## New metrics

### `schedule_overlap_skipped` (counter)

**Tags:** `namespace`, `scheduler_backend`, `schedule_overlap_policy`

Actions are skipped when the overlap policy prevents starting a new
action while a previous one is still running. Today this is only tracked
as a running total in `Info.OverlapSkipped` (visible via
DescribeSchedule) with no time-series visibility.

This counter is emitted for each action dropped by `ProcessBuffer` due
to overlap policy. The `schedule_overlap_policy` tag carries the overlap
policy that was actually applied when that action was evaluated (e.g.,
`SCHEDULE_OVERLAP_POLICY_SKIP` or `SCHEDULE_OVERLAP_POLICY_BUFFER_ONE`).

One subtlety is that buffered actions do not all resolve overlap policy
the same way:

- **v1:** normal scheduled actions are buffered with `UNSPECIFIED` and
resolved later in `ProcessBuffer` against the schedule's current policy.
If `UpdateSchedule` changes the overlap policy while actions are already
buffered, those existing buffered scheduled actions will be evaluated
under the new policy.
- **v2:** the generator resolves overlap policy before buffering and
stores the resolved policy on each `BufferedStart`. If `UpdateSchedule`
changes the overlap policy later, already-buffered scheduled actions
keep the policy they were generated with.

This distinction is why the metric is emitted per skipped action using
the policy that was actually applied at evaluation time, rather than
once per `ProcessBuffer` pass with a single schedule-level tag.

**How to use:**
- `rate(schedule_overlap_skipped[5m])` shows how often actions are being
dropped due to overlap.
- Group by `schedule_overlap_policy` to see which policies are causing
the most skips.
- Compare with `schedule_action_e2e_delay` to see whether skipped
actions correlate with actions whose duration exceeds the schedule
interval, pushing the total schedule-to-start delay higher.

Emitted in v1 (`processBuffer`) and v2
(`InvokerProcessBufferTaskHandler.Execute`).

### `schedule_callback_latency` (timer, v2 only)

**Tags:** `namespace`, `scheduler_backend`

The v2 scheduler learns about action completions via Nexus completion
callbacks. This timer records `time.Since(closeTime)` when
`HandleNexusCompletion` processes a callback, measuring the end-to-end
delay from action completion to the scheduler receiving the result.

This latency includes: callback task creation, callback task queue
processing, cross-shard RPC delivery, and handler execution.

**Why v2 only:** The v1 and v2 schedulers use fundamentally different
mechanisms to discover action completions, and only v2's is worth
measuring independently.

In v1, completion discovery is **tick-driven**. The scheduler wakes on
each scheduled time and does a synchronous `PollMutableState` refresh of
all running actions before making overlap decisions. For a SKIP schedule
with a 1s interval, the scheduler discovers completions at most 1s after
they happen (on the next tick). The long-poll watcher activity exists
but is only useful for BUFFER_ONE/BUFFER_ALL where starts sit in the
buffer waiting for a completion -- for SKIP, there are no buffered
starts and the watcher is irrelevant. The completion discovery latency
in v1 is inherently bounded by the schedule interval itself, so there's
nothing additional to measure.

In v2, completion discovery is **event-driven** via Nexus callbacks. The
scheduler has no periodic refresh; it only learns about completions when
the callback is delivered. This means callback delivery latency directly
gates when the next action can start and when overlap policies are
re-evaluated. A slow callback pipeline silently degrades scheduler
throughput in a way that has no v1 equivalent.

**How to use:**
- High values indicate the callback delivery pipeline is slow, which
directly delays the next action for schedules using BUFFER_ONE or
BUFFER_ALL policies.
- Compare with `schedule_action_delay` to understand how much of the
total action delay is attributable to callback delivery vs. other
factors.
- Useful for diagnosing scenarios where v2 schedules appear to "stall"
between actions.

Emitted in `Scheduler.HandleNexusCompletion`.

## Modified metrics

### `schedule_generate_latency` (timer) -- added to v1, fixed in both

Previously v2-only. This timer measures how late the scheduler is in
processing a due action: `now - scheduledExecutionTime` (the action's
`ActualTime`, after jitter).

**Changes:**
1. **Added to v1 scheduler.** The v1 `processTimeRange` now emits this
metric, matching the existing v2 behavior.
2. **Emit only for the first action per batch.** When the scheduler
catches up over a large time range (e.g., after being down), it
processes many due actions in a single pass. Previously every action got
its own latency sample against the same `now`. Now only the first action
is recorded, representing the actual system delay.
3. **Moved before the catchup window check.** Previously (v2) the metric
was emitted after the catchup window check, meaning the worst latencies
-- the ones that actually miss the window -- were never recorded. Now
it's emitted before, so all latencies including missed actions are
captured.

**How to use:**
- `histogram_quantile(0.99, schedule_generate_latency)` shows worst-case
scheduling delay.
- A rising trend is an early warning of system instability before
actions start being dropped.

### `schedule_missed_catchup_window` (counter) -- added `reason` and
`action_running` tags

Actions can miss the catchup window at two points in the v2 scheduler:
1. **Generator**: the action's scheduled execution time is older than
the catchup window when the generator processes the time range. The
action is never buffered.
2. **Invoker (process buffer)**: the action was buffered in time but
then sat too long waiting for execution (e.g., due to overlap policy
deferral, start retries, or system delay between the gen task and
process buffer task).

Previously both cases emitted the same untagged counter, making it
impossible to distinguish root cause. Now the counter carries two new
tags:

**`reason` tag:**
- `reason=not_buffered` -- the action's scheduled execution time was
already past the catchup window when the generator processed the time
range. It was never buffered for execution. Present in both v1 and v2.
- `reason=buffer_expired` -- the action was buffered in time but expired
before execution (e.g., due to overlap deferral, retries, or system
delay between CHASM tasks). V2 only.

V1 only has the `not_buffered` case because generation and buffer
processing happen synchronously in the same workflow task -- there is no
gap where a buffered start can sit and expire.

**`action_running` tag:**
- `action_running=true` -- a previous action was running when the
buffered start was deferred, or was still running when the start was
dropped. This indicates overlap wait contributed to the missed action.
- `action_running=false` -- no action was blocking the start when it
expired, so the miss was due to other factors (system delay, scheduler
processing time).

Present only on `reason=buffer_expired` (v2). Not added to
`reason=not_buffered` because: in v2, the generator doesn't have access
to the invoker's running action state; in v1, running action state is
stale at the point where the catchup window check runs (the refresh
happens later in `processBuffer`).

**How to use:**
- `schedule_missed_catchup_window{reason="not_buffered"}` means the
scheduler was too late to process the action. This check is independent
of whether a previous action is running.
- `schedule_missed_catchup_window{reason="buffer_expired"}` indicates
actions were buffered on time but the invoker was delayed in executing
them. V2 only.
- `schedule_missed_catchup_window{reason="buffer_expired",
action_running="true"}` narrows the cause to a long-running previous
action blocking execution of buffered starts, including starts that
expire after the previous action completes but before they can be
executed. For v2 it's possible that completion callbacks are not being
delivered in time.
- `schedule_missed_catchup_window{reason="buffer_expired",
action_running="false"}` means no previous action was blocking execution
-- the delay was in CHASM task processing between the generator and
invoker.
- Comparing the two rates shows whether the bottleneck is in waking the
scheduler vs. executing buffered actions.

## Existing metrics (not modified, included for context)

### `schedule_action_delay` (timer)

**Tags:** `namespace`, `scheduler_backend`, `schedule_action`

Measures the time between when an action became eligible for execution
and when it was actually started: `actualStartTime - desiredTime`.

`desiredTime` is either:
- The action's scheduled execution time (`ActualTime`), for the first
action after the scheduler wakes up.
- The `CloseTime` of the previous action, for subsequent buffered starts
(set in `processWatcherResult` in v1, `recordCompletedAction` in v2).
This means that for BUFFER_ONE/BUFFER_ALL, the delay measures only the
gap between the previous action completing and the next one starting,
not the total wait from the original schedule time.

This metric captures the delay from "action eligible" to "action
started," which includes buffer processing, rate limiting, and the
actual start RPC. However, it does **not** capture time spent waiting
for a previous action to complete under BUFFER_ONE/BUFFER_ALL overlap
policies, because `DesiredTime` is reset to the previous action's
`CloseTime` when it completes (in `processWatcherResult` for v1,
`recordCompletedAction` for v2). This means for buffered starts,
`schedule_action_delay` measures only the gap between the slot opening
and the next action starting -- not the total wall-clock wait from the
original schedule time.

For the full wall-clock delay including overlap policy wait, use
`schedule_action_e2e_delay`.

**Relationship to other metrics:**
- `schedule_generate_latency` measures a subset: how late the scheduler
was in waking up to process the action. This is always <=
`schedule_action_delay`.
- `schedule_callback_latency` (v2) measures a different leg: how long it
took to learn about the previous action's completion. For BUFFER_ONE,
high callback latency contributes to high action delay on the next
start.
- `schedule_action_delay` is the metric most directly visible to users
as "how fast did we start after the slot opened."
- `schedule_action_e2e_delay` is the metric most directly visible to
users as "how late was my scheduled action" -- it includes overlap
policy wait time that `schedule_action_delay` excludes.

### `schedule_action_e2e_delay` (timer) -- NEW

**Tags:** `namespace`, `scheduler_backend`, `schedule_action`

Measures the total wall-clock delay from the action's original schedule
time to when it was actually started: `actualStartTime - ActualTime`.
Unlike `schedule_action_delay`, this metric always uses the original
schedule time as the baseline and is never reset when a previous action
completes.

This means for BUFFER_ONE/BUFFER_ALL schedules where an action was
waiting in the buffer for a previous one to finish, this metric includes
the full overlap policy wait time. For the first action after the
scheduler wakes (where `DesiredTime` has not been reset), this metric
will equal `schedule_action_delay`.

**How to use:**
- Compare with `schedule_action_delay` to isolate how much delay is
caused by overlap policy wait vs. other factors. If
`schedule_action_e2e_delay` is high but `schedule_action_delay` is low,
the bottleneck is the previous action's duration.
- `histogram_quantile(0.99, schedule_action_e2e_delay)` shows worst-case
total delay from schedule time to actual start, as experienced by users.

Emitted in v1 (`startWorkflow`) and v2
(`InvokerExecuteTaskHandler.startWorkflow`).

## Detecting missed/delayed actions due to system instability

These metrics form a pipeline from early warning to confirmed impact.
The available signals differ between v1 and v2 due to their
architectural differences.

### V1 (workflow-based scheduler)

The v1 scheduler runs as a single workflow. Generation and buffer
processing happen synchronously in the same workflow task. The scheduler
discovers action completions via a synchronous `PollMutableState`
refresh on each tick.

**Available metrics:** `schedule_generate_latency`,
`schedule_action_delay`, `schedule_action_e2e_delay`,
`schedule_missed_catchup_window{reason="not_buffered"}`,
`schedule_overlap_skipped`

**Diagnosis flow:**

1. **Early warning: `schedule_generate_latency`**
The first metric to move. Measures how late the scheduler is in
processing due actions. Under normal conditions this should be
near-zero. A rising trend means the scheduler is delayed in processing
-- the worker is under load, matching is slow, or the task queue is
backed up.

   ```promql
histogram_quantile(0.99,
rate(schedule_generate_latency_bucket{scheduler_backend="legacy"}[5m]))
   ```

2. **Confirmed drops:
`schedule_missed_catchup_window{reason="not_buffered"}`**
Once generate latency exceeds the catchup window, actions are
permanently lost. In v1 this is always `reason=not_buffered` --
generation and buffer processing are synchronous, so there is no gap
where a buffered start can expire independently.

   ```promql
rate(schedule_missed_catchup_window{scheduler_backend="legacy",
reason="not_buffered"}[5m])
   ```

3. **User-facing impact: `schedule_action_delay` and
`schedule_action_e2e_delay`**
`schedule_action_delay` measures the delay from when the action became
eligible (after any overlap wait) to when it started. Absent high
`schedule_generate_latency`, a high `schedule_action_delay` points to
buffer processing -- rate limiting or the start RPC itself.

`schedule_action_e2e_delay` measures total delay from the original
schedule time, including overlap policy wait. Absent high
`schedule_action_delay`, a high `schedule_action_e2e_delay` indicates
the bottleneck is previous action duration, not system instability.

4. **Overlap context: `schedule_overlap_skipped`**
Absent high `schedule_callback_latency` (v2), a high skip rate suggests
actions are taking longer than the schedule interval (application-level
issue). In v2, delayed callback delivery can also cause skips -- the
scheduler still sees the previous action as running even though it has
completed, because the completion callback hasn't arrived yet.

### V2 (CHASM scheduler)

The v2 scheduler splits work across independent CHASM tasks: a generator
task buffers actions, and an invoker process-buffer task resolves and
executes them. The scheduler learns about action completions via Nexus
callbacks, not polling. This creates additional failure modes between
the generator and invoker, and in the callback delivery pipeline.

**Available metrics:** `schedule_generate_latency`,
`schedule_action_delay`, `schedule_action_e2e_delay`,
`schedule_missed_catchup_window{reason="not_buffered"|"buffer_expired"}`,
`schedule_callback_latency`, `schedule_overlap_skipped`

**Diagnosis flow:**

1. **Early warning: `schedule_generate_latency`**
Same as v1 -- measures how late the generator task is in processing due
actions.

   ```promql
histogram_quantile(0.99,
rate(schedule_generate_latency_bucket{scheduler_backend="chasm"}[5m]))
   ```

2. **Confirmed drops: `schedule_missed_catchup_window`**
   The `reason` tag identifies where the drop occurred:

- `reason=not_buffered`: The generator was too late to buffer the
action. Same as v1 -- the clearest signal that the system is to blame.
- `reason=buffer_expired`: The action was buffered on time but expired
waiting for execution. The `action_running` tag distinguishes cause:
- `action_running=true`: A previous action was still running, or had
deferred the buffered start long enough that it expired before
execution. Absent high `schedule_callback_latency`, this points to
action duration exceeding the catchup window (application-level issue).
If `schedule_callback_latency` is also high, delayed callback delivery
may be the actual cause -- the previous action completed but the
scheduler hasn't learned about it yet.
- `action_running=false`: No action was blocking it. The delay was in
CHASM task processing between the generator and invoker -- system
instability.

   ```promql
   # system-caused drops at the generator (strongest signal)
rate(schedule_missed_catchup_window{scheduler_backend="chasm",
reason="not_buffered"}[5m])

   # system-caused drops at the invoker (no action blocking)
rate(schedule_missed_catchup_window{scheduler_backend="chasm",
reason="buffer_expired", action_running="false"}[5m])

   # drops caused by long-running actions
rate(schedule_missed_catchup_window{scheduler_backend="chasm",
reason="buffer_expired", action_running="true"}[5m])
   ```

3. **Callback pipeline health: `schedule_callback_latency`**
For schedules using BUFFER_ONE or BUFFER_ALL, the callback pipeline is
on the critical path between one action completing and the next
starting. Absent high `schedule_generate_latency`, a high
`schedule_callback_latency` indicates the callback delivery pipeline is
the bottleneck. A slow callback pipeline means the scheduler doesn't
learn about completions promptly, delaying the next buffered action and
potentially causing overlap skips or buffer expirations.

   ```promql
   histogram_quantile(0.99, rate(schedule_callback_latency_bucket[5m]))
   ```

4. **User-facing impact: `schedule_action_delay` and
`schedule_action_e2e_delay`**
`schedule_action_delay` measures delay from when the action became
eligible to when it started. `schedule_action_e2e_delay` measures total
delay from the original schedule time. If the latter is high, work
backwards:

- Is `schedule_generate_latency` high? -> Generator task is delayed
(system instability).
- Is `schedule_callback_latency` high? -> Completion callbacks are slow
(system instability in the callback pipeline).
- Is `schedule_missed_catchup_window{reason="buffer_expired",
action_running="false"}` firing? -> Invoker task is delayed (system
instability between CHASM tasks).
- Is `schedule_action_e2e_delay` high but `schedule_action_delay` low?
-> Previous action duration is the bottleneck (application-level issue).
- Is `schedule_overlap_skipped` spiking? -> Overlap policy is dropping
actions because previous actions run too long (application-level issue).

5. **Overlap context: `schedule_overlap_skipped`**
Same as v1. Absent high `schedule_callback_latency`, a high skip rate
suggests action duration exceeds the schedule interval
(application-level issue). If `schedule_callback_latency` is also high,
the skips may be caused by delayed completion callbacks rather than
actual action duration.

## Detecting missed/delayed actions due to schedule configuration

Actions can be missed or delayed due to the schedule's own configuration
-- for example, when action duration exceeds the schedule interval.
These are not caused by system instability but may still require
attention.

**Symptoms:**
- `schedule_overlap_skipped` is elevated, absent high
`schedule_callback_latency` (v2). Actions are being dropped because the
previous action hasn't completed by the time the next one is due.
- `schedule_action_e2e_delay` is high but `schedule_action_delay` is
low. The total delay from schedule time to start is large, but the
scheduler started the action promptly once the slot opened. The gap is
overlap policy wait.
- `schedule_missed_catchup_window{reason="buffer_expired",
action_running="true"}` is firing (v2 only), absent high
`schedule_callback_latency`. Buffered actions are expiring because they
spent too long waiting behind a previous action.

**Diagnosis:**
1. Compare `schedule_action_e2e_delay` with `schedule_action_delay`. A
large gap between them indicates time spent waiting for overlap policy
to allow execution.
2. Check `schedule_overlap_skipped` grouped by
`schedule_overlap_policy`. A high rate on SKIP or BUFFER_ONE indicates
the schedule interval is shorter than the action duration.
3. For BUFFER_ONE/BUFFER_ALL schedules, check whether
`schedule_missed_catchup_window{reason="buffer_expired",
action_running="true"}` is firing. This means buffered actions are
waiting so long for the slot that they exceed the catchup window.

**Resolution:** These issues are addressed by changing the schedule
configuration -- increasing the interval, switching to a more permissive
overlap policy, or reducing action duration. They do not indicate a
problem with the scheduler or the platform.

### `schedule_buffer_overruns` (counter)

**Tags:** `namespace`, `scheduler_backend`

Counts actions dropped because the internal action buffer is full
(bounded by `MaxBufferSize`).

Emitted in v1 (`processTimeRange` and signal handling) and v2
(`GeneratorTaskHandler.Execute`).

## Backfills and manual triggers

Backfills and manual triggers (TriggerImmediately) generate actions with
the `Manual` flag set. Each backfill request can specify its own overlap
policy -- if unspecified, it resolves to the schedule's configured
policy. Backfill requests can use ALLOW_ALL to run all backfilled
actions concurrently.

**Not emitted for manual/backfill actions:**
- `schedule_generate_latency` -- backfill actions are generated against
a historical time range, so `now - ActualTime` is meaningless as a
system delay indicator.
- `schedule_missed_catchup_window` -- backfill actions skip the catchup
window check entirely. The time range is intentionally in the past.
- `schedule_action_delay` and `schedule_action_e2e_delay` -- the delay
between a historical schedule time and now is not meaningful for
backfills.

**Emitted for manual/backfill actions:**
- `schedule_overlap_skipped` -- backfill actions go through
`ProcessBuffer` and are subject to overlap policy. A backfill with SKIP
policy against a running action will increment this counter. The
`schedule_overlap_policy` tag reflects the backfill's own policy, not
the schedule's configured policy.
- `schedule_buffer_overruns` -- backfills can fill the action buffer,
especially large backfills over wide time ranges. In v1, backfills are
limited to half the buffer (`MaxBufferSize/2`) and processed
incrementally (`BackfillsPerIteration` per tick). In v2, each backfill
request creates a separate `Backfiller` component that generates actions
independently.
- `schedule_callback_latency` (v2) -- recorded when any action's
completion callback arrives, regardless of whether it was manually
triggered.

**Interaction with regular actions:** Whether backfills contend with
regular actions depends on the backfill's overlap policy:

- **ALLOW_ALL (typical for backfills):** Actions started with ALLOW_ALL
are not tracked as running. This means they do not trigger overlap
policy evaluation on regular actions -- a SKIP schedule will continue to
start regular actions normally even while ALLOW_ALL backfill actions are
running. Buffer contention is the only possible interference: in v1,
backfills self-limit to half the buffer (`MaxBufferSize/2`) and add at
most `BackfillsPerIteration` (default 10) per tick, with the generator
running first each tick. Buffer contention is unlikely unless
backfill-started actions from previous ticks accumulate in the buffer
faster than they can be executed.

- **Non-ALLOW_ALL (SKIP, BUFFER_ONE, etc.):** Actions started with these
policies are tracked as running. A backfill-started action that is still
running will cause regular actions to be skipped or deferred by overlap
policy. This can show up as elevated `schedule_overlap_skipped` or
increased `schedule_action_e2e_delay` on regular actions while the
backfill is in progress. This is the primary contention scenario -- the
backfill's running actions block the schedule's normal execution.

## Glossary

- **Scheduled action**: The operation the scheduler executes at each
scheduled time. Currently this is always starting a workflow, but the
scheduler is designed to support other primitives (e.g., activities) in
the future.

- **Nominal time** (`NominalTime`): The exact time that matches the
schedule spec, before jitter is applied. Used to generate the workflow
ID (ensuring deterministic naming) and the request ID. Corresponds to
`GetNextTimeResult.Nominal`.

- **Actual time** (`ActualTime`): The nominal time with jitter applied.
This is the time the action is actually scheduled to execute.
Corresponds to `GetNextTimeResult.Next`. Stored on
`BufferedStart.ActualTime`.

- **Desired time** (`DesiredTime`): The baseline time used to compute
`schedule_action_delay`. Initially nil. When a previous action
completes, `DesiredTime` on the next pending buffered start is set to
the previous action's `CloseTime`. If nil at metric emission time, falls
back to `ActualTime`. This reset means `schedule_action_delay` measures
the gap from "slot opened" to "action started" for buffered starts, not
the total wait from the original schedule time.

- **Catchup window**: A configurable duration (default: 365 days,
minimum: 10 seconds) that determines how old an action can be before it
is permanently dropped. If `now - ActualTime > catchupWindow`, the
action is considered too stale to execute. Checked in the generator
(both v1 and v2) and again in the v2 invoker for buffered actions.

- **Overlap policy**: Determines what happens when a new action is due
but a previous action is still running.
- **SKIP**: Drop the new action. No concurrent or queued actions. This
is the default policy.
- **BUFFER_ONE**: Queue at most one pending action. When the running
action completes, the queued action starts. If another action comes due
while one is already queued, the new one is dropped.
- **BUFFER_ALL**: Queue all pending actions. Every due action is queued
and will eventually execute in order after the running action completes.
- **CANCEL_OTHER**: Cancel the running action and start the new one
immediately.
- **TERMINATE_OTHER**: Terminate the running action and start the new
one immediately.

- **Generator**: The phase that iterates over the schedule spec's time
range, checks the catchup window, and buffers actions for execution. In
v1 this is `processTimeRange`; in v2 this is the `GeneratorTaskHandler`.

- **Invoker / buffer processing**: The phase that evaluates overlap
policy on the buffer and executes actions. In v1 this is
`processBuffer`; in v2 this is the `InvokerProcessBufferTaskHandler`
followed by `InvokerExecuteTaskHandler`.

- **Deferred action**: A buffered action that was evaluated by
`ProcessBuffer` but could not be executed because a previous action is
still running under BUFFER_ONE or BUFFER_ALL policy. The action stays in
the buffer and is re-evaluated on the next processing pass. In v2,
deferred actions are marked with `Attempt = -1` to distinguish them from
newly enqueued actions. In v1, deferred actions remain in the buffer
with their original state and are re-evaluated on each tick.

(cherry picked from commit 71dd8ed)
@davidporter-id-au davidporter-id-au changed the title Backport scheduler changes to cloud/v1.32.0-157 fix: Backport scheduler changes to cloud/v1.32.0-157 Jun 8, 2026
@davidporter-id-au davidporter-id-au marked this pull request as ready for review June 8, 2026 23:54
@davidporter-id-au davidporter-id-au requested review from a team June 8, 2026 23:54
@davidporter-id-au davidporter-id-au requested review from a team as code owners June 8, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants